CBOW (Continuous Bag of Words) Embedding Berita#
Notebook ini menjelaskan implementasi CBOW embedding menggunakan Word2Vec untuk menganalisis teks berita. CBOW adalah salah satu arsitektur Word2Vec yang memprediksi kata target berdasarkan konteks kata-kata di sekitarnya.
Tujuan:#
Membuat embedding vektor untuk kata-kata dalam dataset berita
Menggunakan Word2Vec dengan arsitektur CBOW
Mengekstrak fitur numerik dari teks untuk analisis lebih lanjut
1. Instalasi Library#
Menginstal library yang diperlukan:
plotly: untuk visualisasi interaktifgensim: library utama untuk Word2Vec dan embedding
%%capture
!pip install plotly
!pip install --upgrade gensim
2. Import Library dan Load Data#
Mengimport library yang diperlukan dan memuat dataset berita yang sudah dipreprocessing:
gensim.models: untuk Word2Vec dan FastTextpandas: untuk manipulasi datasklearn.decomposition.PCA: untuk reduksi dimensimatplotlibdanplotly: untuk visualisasinumpy: untuk operasi numerik
from gensim.models import Word2Vec, FastText
import pandas as pd
import re
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import plotly.graph_objects as go
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('hasil_preprocessing_berita.csv')
df
| isi | hasil_preprocessing | kategori | |
|---|---|---|---|
| 0 | TNImasih mempertimbangkan langkah hukum yang a... | ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... | nasional |
| 1 | Politikus Partai GerindraRahayu Saraswati Djoj... | ['politikus', 'partai', 'gerindrarahayu', 'sar... | nasional |
| 2 | Staf Khusus Gubernur DKI Jakarta Bidang Komuni... | ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... | nasional |
| 3 | Politikus Partai Gerindrayang juga keponakan P... | ['politikus', 'partai', 'gerindrayang', 'kepon... | nasional |
| 4 | Keponakan Presiden Prabowo Subianto,Rahayu Sar... | ['keponakan', 'presiden', 'prabowo', 'subianto... | nasional |
| ... | ... | ... | ... |
| 146 | DKI evaluasi cakupanimunisasicampak hingga tin... | ['dki', 'evaluasi', 'cakupanimunisasicampak', ... | gaya-hidup |
| 147 | Situasi negara saat ini tak pelak bikinstres. ... | ['situasi', 'negara', 'pelak', 'bikinstres', '... | gaya-hidup |
| 148 | Banyak pakarkesehatanmenganjurkan konsumsisayu... | ['pakarkesehatanmenganjurkan', 'konsumsisayura... | gaya-hidup |
| 149 | MaskapaiRyanair mengimbau para penumpang yang ... | ['maskapairyanair', 'imbau', 'tumpang', 'alami... | gaya-hidup |
| 150 | Mengenal gejala dan penanganandiabetespada ana... | ['kenal', 'gejala', 'penanganandiabetespada', ... | gaya-hidup |
151 rows × 3 columns
3. Definisi Kelas Custom#
MyTokenizer#
Kelas untuk tokenisasi teks sederhana:
Mengubah teks menjadi lowercase
Memisahkan kata berdasarkan spasi
MeanEmbeddingVectorizer#
Kelas untuk mengubah teks menjadi vektor embedding:
Menggunakan model Word2Vec yang sudah dilatih
Mengambil rata-rata vektor kata untuk setiap dokumen
Menangani kata yang tidak ada dalam vocabulary
from gensim.models import Word2Vec
4. Preprocessing Teks#
Membersihkan teks berita dengan:
Konversi ke lowercase: Menyeragamkan format teks
Menghapus punctuation: Menghilangkan tanda baca dan karakter non-alfabet
Menghapus HTML tags: Membersihkan tag HTML jika ada
Menghapus digit dan karakter khusus: Membersihkan angka dan karakter non-alfabet
Hasil preprocessing disimpan dalam kolom ‘clean’.
import numpy as np
class MyTokenizer:
def fit_transform(self, texts):
# Tokenisasi sederhana: lowercase + split
return [str(text).lower().split() for text in texts]
class MeanEmbeddingVectorizer:
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
# Perbaikan: gunakan vector_size (Gensim ≥ 4.0)
self.dim = word2vec_model.wv.vector_size
def fit(self, X, y=None):
return self
def transform(self, X):
X_tokenized = MyTokenizer().fit_transform(X)
embeddings = []
for words in X_tokenized:
# Ambil vektor hanya untuk kata yang ada di vocab
valid_vectors = [
self.word2vec.wv[word] for word in words
if word in self.word2vec.wv
]
if valid_vectors:
embeddings.append(np.mean(valid_vectors, axis=0))
else:
embeddings.append(np.zeros(self.dim))
return np.array(embeddings)
def fit_transform(self, X, y=None):
return self.transform(X)
5. Pembuatan Corpus dan Training Word2Vec#
Pembuatan Corpus#
Memecah teks yang sudah dibersihkan menjadi list kata
Setiap dokumen menjadi list kata terpisah
Training Model Word2Vec#
Arsitektur: CBOW (default Word2Vec)
min_count=1: Termasuk semua kata (bahkan yang muncul sekali)
vector_size=56: Dimensi vektor embedding 56
Model akan mempelajari representasi vektor untuk setiap kata berdasarkan konteksnya
clean_txt = []
for w in range(len(df['hasil_preprocessing'])):
desc = str(df['hasil_preprocessing'][w]).lower()
#remove punctuation
desc = re.sub('[^a-zA-Z]', ' ', desc)
#remove tags
desc=re.sub("</?.*?>"," <> ",desc)
#remove digits and special chars
desc=re.sub("(\\d|\\W)+"," ",desc)
clean_txt.append(desc)
df['clean'] = clean_txt
df.head()
| isi | hasil_preprocessing | kategori | clean | |
|---|---|---|---|---|
| 0 | TNImasih mempertimbangkan langkah hukum yang a... | ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... | nasional | tnimasih timbang langkah hukum ambil ceo mala... |
| 1 | Politikus Partai GerindraRahayu Saraswati Djoj... | ['politikus', 'partai', 'gerindrarahayu', 'sar... | nasional | politikus partai gerindrarahayu saraswati djo... |
| 2 | Staf Khusus Gubernur DKI Jakarta Bidang Komuni... | ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... | nasional | staf khusus gubernur dki jakarta bidang komun... |
| 3 | Politikus Partai Gerindrayang juga keponakan P... | ['politikus', 'partai', 'gerindrayang', 'kepon... | nasional | politikus partai gerindrayang keponakan presi... |
| 4 | Keponakan Presiden Prabowo Subianto,Rahayu Sar... | ['keponakan', 'presiden', 'prabowo', 'subianto... | nasional | keponakan presiden prabowo subiantorahayu sar... |
6. Eksplorasi Model Word2Vec#
Analisis Similaritas Kata#
most_similar(): Mencari kata yang paling mirip dengan kata probe
most_similar_cosmul(): Mencari kata yang mirip dengan kombinasi kata positif dan negatif
doesnt_match(): Mencari kata yang tidak cocok dalam sekelompok kata
Penyimpanan Embedding#
Menyimpan vektor embedding dalam format Word2Vec
File:
berita_embd.txt(format teks, bukan binary)
df.shape
(151, 4)
7. Ekstraksi Embedding untuk Dokumen#
Menggunakan MeanEmbeddingVectorizer untuk mengubah setiap dokumen menjadi vektor:
Input: Teks dokumen yang sudah dibersihkan
Proses:
Tokenisasi teks menjadi kata-kata
Ambil vektor embedding untuk setiap kata dari model Word2Vec
Hitung rata-rata vektor kata untuk mendapatkan representasi dokumen
Output: Vektor 56 dimensi untuk setiap dokumen
corpus = []
for col in df.clean:
word_list = col.split(" ")
corpus.append(word_list)
#show first value
corpus[0:1]
#generate vectors from corpus
model = Word2Vec(corpus, min_count=1, vector_size = 56)
8. Validasi Embedding#
Memeriksa panjang embedding untuk memastikan konsistensi:
Setiap dokumen harus memiliki vektor dengan panjang 56 (sesuai dengan vector_size)
Ini memastikan bahwa proses embedding berjalan dengan benar
# Explore embeddings safely using an in-vocabulary token
# Pick a common Indonesian token if available, else fallback to the first vocab token
candidate_tokens = ['indonesia', 'pemerintah', 'jakarta', 'presiden', 'ekonomi']
probe = None
for tok in candidate_tokens:
if tok in model.wv:
probe = tok
break
if probe is None:
probe = model.wv.index_to_key[0]
print('Probe token:', probe)
print('Top similar:')
print(model.wv.most_similar(probe)[:10])
# Optional: cosine mul example if tokens exist
pos = [t for t in ['pemerintah', 'indonesia'] if t in model.wv]
neg = [t for t in ['oposisi'] if t in model.wv]
if pos:
print('Cosmul example:')
print(model.wv.most_similar_cosmul(positive=pos, negative=neg)[:10])
# Optional: doesnt_match example when enough tokens exist
cands = [t for t in ['ekonomi', 'politik', 'olahraga', 'jakarta'] if t in model.wv]
if len(cands) >= 3:
print('Odd-one-out:')
print(model.wv.doesnt_match(cands))
# Save embeddings
filename = 'berita_embd.txt'
model.wv.save_word2vec_format(filename, binary=False)
Probe token: indonesia
Top similar:
[('jalan', 0.994790256023407), ('persen', 0.9943721890449524), ('dukung', 0.9940369725227356), ('perintah', 0.9933809638023376), ('salah', 0.9932514429092407), ('orang', 0.9931586384773254), ('ekonomi', 0.9930185675621033), ('usaha', 0.9929376840591431), ('kali', 0.9924567341804504), ('purbaya', 0.9922690391540527)]
Cosmul example:
[('jalan', 0.9973941445350647), ('persen', 0.9971851110458374), ('dukung', 0.997017502784729), ('perintah', 0.99668949842453), ('salah', 0.996624767780304), ('orang', 0.9965783357620239), ('ekonomi', 0.9965083003044128), ('usaha', 0.9964678883552551), ('kali', 0.9962273836135864), ('purbaya', 0.99613356590271)]
Odd-one-out:
politik
9. Konversi ke DataFrame#
Mengubah array embedding menjadi DataFrame dengan kolom terpisah:
Input: Array embedding 2D (151 dokumen × 56 fitur)
Proses:
Membuat kolom f1, f2, …, f56 untuk setiap dimensi
Mengisi setiap kolom dengan nilai dari dimensi yang sesuai
Output: DataFrame dengan 151 baris dan 56 kolom fitur
Tujuan: Memudahkan analisis dan visualisasi data
12. Visualisasi Embedding#
Menambahkan visualisasi untuk menganalisis hasil embedding:
PCA Visualization: Reduksi dimensi untuk visualisasi 2D
Similarity Heatmap: Matriks similaritas antar dokumen
Embedding Distribution: Distribusi nilai embedding
Category Analysis: Analisis embedding berdasarkan kategori
mean_embedding_vectorizer = MeanEmbeddingVectorizer(model)
mean_embedded = mean_embedding_vectorizer.fit_transform(df['clean'])
10. Penambahan Label (Opsional)#
Mencoba menambahkan kolom label jika tersedia:
Mencari kolom ‘kategori’ dalam DataFrame asli
Jika ditemukan, menyalin label ke DataFrame embedding
Jika tidak ditemukan, memberikan peringatan
Catatan: Label diperlukan untuk supervised learning atau evaluasi model.
df['array']=list(mean_embedded)
11. Hasil Akhir#
Ringkasan Proses:#
Preprocessing: Membersihkan teks berita
Training Word2Vec: Membuat model CBOW dengan 56 dimensi
Ekstraksi Embedding: Mengubah dokumen menjadi vektor numerik
Konversi DataFrame: Mengubah array menjadi format tabular
Output:#
DataFrame embedding: 151 baris × 56 kolom fitur
File embedding:
berita_embd.txt(format Word2Vec)Model Word2Vec: Siap digunakan untuk analisis similaritas kata
Aplikasi Selanjutnya:#
Clustering dokumen
Klasifikasi teks
Analisis similaritas dokumen
Visualisasi embedding dengan PCA/t-SNE
df.head(5)
| isi | hasil_preprocessing | kategori | clean | array | |
|---|---|---|---|---|---|
| 0 | TNImasih mempertimbangkan langkah hukum yang a... | ['tnimasih', 'timbang', 'langkah', 'hukum', 'a... | nasional | tnimasih timbang langkah hukum ambil ceo mala... | [-0.03424158, 0.023815494, 0.017993217, 0.0083... |
| 1 | Politikus Partai GerindraRahayu Saraswati Djoj... | ['politikus', 'partai', 'gerindrarahayu', 'sar... | nasional | politikus partai gerindrarahayu saraswati djo... | [-0.042456638, 0.030742344, 0.024186132, 0.011... |
| 2 | Staf Khusus Gubernur DKI Jakarta Bidang Komuni... | ['staf', 'khusus', 'gubernur', 'dki', 'jakarta... | nasional | staf khusus gubernur dki jakarta bidang komun... | [-0.035708997, 0.026341174, 0.019657917, 0.007... |
| 3 | Politikus Partai Gerindrayang juga keponakan P... | ['politikus', 'partai', 'gerindrayang', 'kepon... | nasional | politikus partai gerindrayang keponakan presi... | [-0.042115077, 0.03054803, 0.021272218, 0.0093... |
| 4 | Keponakan Presiden Prabowo Subianto,Rahayu Sar... | ['keponakan', 'presiden', 'prabowo', 'subianto... | nasional | keponakan presiden prabowo subiantorahayu sar... | [-0.04753819, 0.031029927, 0.026658049, 0.0126... |
df['embedding_length'] = df['array'].str.len()
print(df['embedding_length'])
0 56
1 56
2 56
3 56
4 56
..
146 56
147 56
148 56
149 56
150 56
Name: embedding_length, Length: 151, dtype: int64
df.shape
(151, 6)
num_features = len(df['array'].iloc[0]) # asumsi semua list punya panjang sama
columns = [f'f{i+1}' for i in range(num_features)]
# Inisialisasi dictionary untuk menampung data per kolom
data_dict = {col: [] for col in columns}
# Looping setiap baris di kolom 'embedding'
for embedding_list in df['array']:
for i, value in enumerate(embedding_list):
data_dict[f'f{i+1}'].append(value)
# Buat DataFrame dari dictionary
embedding_df = pd.DataFrame(data_dict)
embedding_df
| f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f47 | f48 | f49 | f50 | f51 | f52 | f53 | f54 | f55 | f56 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.034242 | 0.023815 | 0.017993 | 0.008343 | 0.001025 | -0.046393 | 0.012217 | -0.050655 | -0.031033 | -0.023668 | ... | 0.009383 | 0.025085 | -0.005328 | 0.019430 | 0.002036 | 0.035565 | 0.000871 | 0.007080 | -0.012785 | -0.023511 |
| 1 | -0.042457 | 0.030742 | 0.024186 | 0.011818 | -0.001329 | -0.062848 | 0.013246 | -0.070055 | -0.040546 | -0.035038 | ... | 0.013874 | 0.031744 | -0.007445 | 0.026577 | 0.001240 | 0.045626 | -0.004341 | 0.009710 | -0.017322 | -0.031579 |
| 2 | -0.035709 | 0.026341 | 0.019658 | 0.007437 | -0.001668 | -0.052641 | 0.012276 | -0.054184 | -0.033147 | -0.025083 | ... | 0.009727 | 0.025435 | -0.004278 | 0.020648 | 0.000802 | 0.037979 | -0.001430 | 0.008824 | -0.015628 | -0.025066 |
| 3 | -0.042115 | 0.030548 | 0.021272 | 0.009377 | -0.003075 | -0.056875 | 0.012645 | -0.056823 | -0.035181 | -0.028878 | ... | 0.010294 | 0.029869 | -0.008437 | 0.024573 | 0.004130 | 0.040517 | -0.000205 | 0.011971 | -0.014870 | -0.026872 |
| 4 | -0.047538 | 0.031030 | 0.026658 | 0.012629 | -0.001332 | -0.067935 | 0.015888 | -0.074749 | -0.044312 | -0.037645 | ... | 0.015583 | 0.034910 | -0.009198 | 0.030109 | 0.000316 | 0.052512 | -0.004867 | 0.009321 | -0.019838 | -0.035834 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146 | -0.040215 | 0.029681 | 0.021212 | 0.008749 | 0.000038 | -0.058004 | 0.013599 | -0.062187 | -0.037696 | -0.032407 | ... | 0.012856 | 0.029878 | -0.005884 | 0.026453 | 0.001851 | 0.043063 | -0.000105 | 0.010249 | -0.016052 | -0.028944 |
| 147 | -0.047534 | 0.034729 | 0.022151 | 0.012573 | -0.001855 | -0.061164 | 0.013678 | -0.063166 | -0.038975 | -0.029292 | ... | 0.011725 | 0.035503 | -0.007956 | 0.024593 | 0.005020 | 0.043391 | 0.001213 | 0.010205 | -0.013986 | -0.029877 |
| 148 | -0.052395 | 0.037619 | 0.026534 | 0.008568 | -0.003877 | -0.067647 | 0.015677 | -0.070051 | -0.040592 | -0.031019 | ... | 0.010935 | 0.040182 | -0.009143 | 0.030671 | 0.004978 | 0.050077 | 0.000574 | 0.014260 | -0.015499 | -0.032130 |
| 149 | -0.040506 | 0.029215 | 0.021303 | 0.007917 | -0.000835 | -0.056548 | 0.014421 | -0.057189 | -0.035690 | -0.030028 | ... | 0.012104 | 0.032007 | -0.006563 | 0.026857 | 0.002147 | 0.042390 | 0.000947 | 0.009823 | -0.013878 | -0.028389 |
| 150 | -0.044656 | 0.033810 | 0.019051 | 0.009090 | -0.000588 | -0.060648 | 0.013711 | -0.062389 | -0.037996 | -0.030327 | ... | 0.011028 | 0.033929 | -0.006889 | 0.027399 | 0.005729 | 0.041514 | -0.001705 | 0.010731 | -0.016428 | -0.031244 |
151 rows × 56 columns
# Tambahkan kolom label jika tersedia pada df
possible_labels = ['kategori']
label_col = None
for c in possible_labels:
if c in df.columns:
label_col = c
break
if label_col is not None:
embedding_df[label_col] = df[label_col].values
else:
print('Peringatan: Tidak ditemukan kolom label di df. Lewati penyalinan label.')
embedding_df
| f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f48 | f49 | f50 | f51 | f52 | f53 | f54 | f55 | f56 | kategori | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.034242 | 0.023815 | 0.017993 | 0.008343 | 0.001025 | -0.046393 | 0.012217 | -0.050655 | -0.031033 | -0.023668 | ... | 0.025085 | -0.005328 | 0.019430 | 0.002036 | 0.035565 | 0.000871 | 0.007080 | -0.012785 | -0.023511 | nasional |
| 1 | -0.042457 | 0.030742 | 0.024186 | 0.011818 | -0.001329 | -0.062848 | 0.013246 | -0.070055 | -0.040546 | -0.035038 | ... | 0.031744 | -0.007445 | 0.026577 | 0.001240 | 0.045626 | -0.004341 | 0.009710 | -0.017322 | -0.031579 | nasional |
| 2 | -0.035709 | 0.026341 | 0.019658 | 0.007437 | -0.001668 | -0.052641 | 0.012276 | -0.054184 | -0.033147 | -0.025083 | ... | 0.025435 | -0.004278 | 0.020648 | 0.000802 | 0.037979 | -0.001430 | 0.008824 | -0.015628 | -0.025066 | nasional |
| 3 | -0.042115 | 0.030548 | 0.021272 | 0.009377 | -0.003075 | -0.056875 | 0.012645 | -0.056823 | -0.035181 | -0.028878 | ... | 0.029869 | -0.008437 | 0.024573 | 0.004130 | 0.040517 | -0.000205 | 0.011971 | -0.014870 | -0.026872 | nasional |
| 4 | -0.047538 | 0.031030 | 0.026658 | 0.012629 | -0.001332 | -0.067935 | 0.015888 | -0.074749 | -0.044312 | -0.037645 | ... | 0.034910 | -0.009198 | 0.030109 | 0.000316 | 0.052512 | -0.004867 | 0.009321 | -0.019838 | -0.035834 | nasional |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146 | -0.040215 | 0.029681 | 0.021212 | 0.008749 | 0.000038 | -0.058004 | 0.013599 | -0.062187 | -0.037696 | -0.032407 | ... | 0.029878 | -0.005884 | 0.026453 | 0.001851 | 0.043063 | -0.000105 | 0.010249 | -0.016052 | -0.028944 | gaya-hidup |
| 147 | -0.047534 | 0.034729 | 0.022151 | 0.012573 | -0.001855 | -0.061164 | 0.013678 | -0.063166 | -0.038975 | -0.029292 | ... | 0.035503 | -0.007956 | 0.024593 | 0.005020 | 0.043391 | 0.001213 | 0.010205 | -0.013986 | -0.029877 | gaya-hidup |
| 148 | -0.052395 | 0.037619 | 0.026534 | 0.008568 | -0.003877 | -0.067647 | 0.015677 | -0.070051 | -0.040592 | -0.031019 | ... | 0.040182 | -0.009143 | 0.030671 | 0.004978 | 0.050077 | 0.000574 | 0.014260 | -0.015499 | -0.032130 | gaya-hidup |
| 149 | -0.040506 | 0.029215 | 0.021303 | 0.007917 | -0.000835 | -0.056548 | 0.014421 | -0.057189 | -0.035690 | -0.030028 | ... | 0.032007 | -0.006563 | 0.026857 | 0.002147 | 0.042390 | 0.000947 | 0.009823 | -0.013878 | -0.028389 | gaya-hidup |
| 150 | -0.044656 | 0.033810 | 0.019051 | 0.009090 | -0.000588 | -0.060648 | 0.013711 | -0.062389 | -0.037996 | -0.030327 | ... | 0.033929 | -0.006889 | 0.027399 | 0.005729 | 0.041514 | -0.001705 | 0.010731 | -0.016428 | -0.031244 | gaya-hidup |
151 rows × 57 columns
embedding_df.shape
(151, 57)
# 1. PCA Visualization untuk Embedding
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Reduksi dimensi dengan PCA
pca = PCA(n_components=2)
embedding_2d = pca.fit_transform(embedding_df.iloc[:, :-1]) # Exclude kategori column
# Visualisasi dengan Matplotlib
plt.figure(figsize=(12, 8))
categories = embedding_df['kategori'].unique()
colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, category in enumerate(categories):
mask = embedding_df['kategori'] == category
plt.scatter(embedding_2d[mask, 0], embedding_2d[mask, 1],
c=colors[i % len(colors)], label=category, alpha=0.7, s=50)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA Visualization of News Embeddings by Category')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Explained variance ratio: PC1={pca.explained_variance_ratio_[0]:.3f}, PC2={pca.explained_variance_ratio_[1]:.3f}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")
Explained variance ratio: PC1=0.937, PC2=0.028
Total explained variance: 0.966
# 3. Similarity Heatmap untuk beberapa dokumen
# Ambil sample 20 dokumen untuk heatmap
sample_size = min(20, len(embedding_df))
sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
sample_embeddings = embedding_df.iloc[sample_indices, :-1] # Exclude kategori
# Hitung cosine similarity
similarity_matrix = cosine_similarity(sample_embeddings)
# Visualisasi heatmap dengan Matplotlib
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
plt.xlabel('Document Index')
plt.ylabel('Document Index')
# Tambahkan label kategori
categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
for i, cat in enumerate(categories_sample):
plt.text(i, -0.5, cat[:3], rotation=45, ha='right', va='top', fontsize=8)
plt.tight_layout()
plt.show()
print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"Average similarity: {similarity_matrix.mean():.3f}")
print(f"Max similarity: {similarity_matrix.max():.3f}")
print(f"Min similarity: {similarity_matrix.min():.3f}")
Similarity matrix shape: (20, 20)
Average similarity: 0.997
Max similarity: 1.000
Min similarity: 0.990
# 4. Distribusi Embedding per Kategori
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()
# Ambil beberapa fitur untuk dianalisis
feature_cols = ['f1', 'f2', 'f3', 'f4']
for i, feature in enumerate(feature_cols):
for category in embedding_df['kategori'].unique():
data = embedding_df[embedding_df['kategori'] == category][feature]
axes[i].hist(data, alpha=0.6, label=category, bins=20)
axes[i].set_title(f'Distribution of {feature} by Category')
axes[i].set_xlabel(feature)
axes[i].set_ylabel('Frequency')
axes[i].legend()
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# 5. Analisis Similaritas Kata dengan Word2Vec
# Ambil beberapa kata yang ada dalam vocabulary
vocab_words = list(model.wv.key_to_index.keys())[:20] # Ambil 20 kata pertama
# Hitung similarity matrix untuk kata-kata
word_similarities = []
for word1 in vocab_words:
row = []
for word2 in vocab_words:
if word1 in model.wv and word2 in model.wv:
similarity = model.wv.similarity(word1, word2)
row.append(similarity)
else:
row.append(0)
word_similarities.append(row)
word_similarities = np.array(word_similarities)
# Visualisasi heatmap similarity kata
plt.figure(figsize=(12, 10))
plt.imshow(word_similarities, cmap='viridis', aspect='auto')
plt.colorbar(label='Word Similarity')
plt.title('Word Similarity Matrix (Word2Vec)')
plt.xlabel('Words')
plt.ylabel('Words')
# Set labels
plt.xticks(range(len(vocab_words)), vocab_words, rotation=45, ha='right')
plt.yticks(range(len(vocab_words)), vocab_words)
plt.tight_layout()
plt.show()
print(f"Vocabulary size: {len(model.wv.key_to_index)}")
print(f"Sample words: {vocab_words[:10]}")
Vocabulary size: 5847
Sample words: ['', 'iphone', 'indonesia', 'to', 'scroll', 'with', 'content', 'continue', 'advertisement', 'menteri']
# Test Plotly setelah install nbformat
import plotly.express as px
import pandas as pd
import numpy as np
# Buat data test sederhana
test_data = pd.DataFrame({
'x': np.random.randn(10),
'y': np.random.randn(10),
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C']
})
# Test plotly
fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly')
fig.show()
print("✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.")
✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.
# Solusi 2: Install ulang library di dalam notebook
import sys
!{sys.executable} -m pip install --upgrade nbformat ipython
Requirement already satisfied: nbformat in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (5.10.4)
Requirement already satisfied: ipython in c:\users\user\appdata\roaming\python\python311\site-packages (9.6.0)
Requirement already satisfied: fastjsonschema>=2.15 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (2.21.2)
Requirement already satisfied: jsonschema>=2.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (4.25.1)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.8.1)
Requirement already satisfied: traitlets>=5.1 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.14.3)
Requirement already satisfied: colorama in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (0.4.6)
Requirement already satisfied: decorator in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (1.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.19.2)
Requirement already satisfied: matplotlib-inline in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.1.7)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (3.0.52)
Requirement already satisfied: pygments>=2.4.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (2.19.2)
Requirement already satisfied: stack_data in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.6.3)
Requirement already satisfied: typing_extensions>=4.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (4.15.0)
Requirement already satisfied: wcwidth in c:\users\user\appdata\roaming\python\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython) (0.2.14)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\user\appdata\roaming\python\python311\site-packages (from jedi>=0.16->ipython) (0.8.5)
Requirement already satisfied: attrs>=22.2.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (25.3.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.36.2)
Requirement already satisfied: rpds-py>=0.7.1 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.27.1)
Requirement already satisfied: platformdirs>=2.5 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (4.4.0)
Requirement already satisfied: pywin32>=300 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (311)
Requirement already satisfied: executing>=1.2.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (2.2.1)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (3.0.0)
Requirement already satisfied: pure-eval in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (0.2.3)
# Solusi 3: Set renderer plotly yang berbeda
import plotly.io as pio
# Coba beberapa renderer yang berbeda
try:
# Renderer untuk Jupyter notebook
pio.renderers.default = "notebook"
print("✅ Renderer set ke 'notebook'")
except:
try:
# Renderer untuk browser
pio.renderers.default = "browser"
print("✅ Renderer set ke 'browser'")
except:
# Renderer HTML
pio.renderers.default = "html"
print("✅ Renderer set ke 'html'")
# Test dengan data sederhana
import plotly.express as px
import pandas as pd
import numpy as np
test_data = pd.DataFrame({
'x': [1, 2, 3, 4, 5],
'y': [2, 4, 1, 3, 5],
'category': ['A', 'B', 'A', 'C', 'B']
})
fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly dengan Renderer Baru')
fig.show()
✅ Renderer set ke 'notebook'
# Solusi untuk Error Similarity Heatmap
# Import library yang diperlukan
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np
# Cek apakah embedding_df sudah ada
try:
# Cek apakah embedding_df sudah dibuat
if 'embedding_df' not in locals():
print("❌ Error: embedding_df belum dibuat. Jalankan cell sebelumnya terlebih dahulu.")
else:
print(f"✅ embedding_df tersedia dengan shape: {embedding_df.shape}")
# Cek apakah kolom kategori ada
if 'kategori' not in embedding_df.columns:
print("❌ Error: Kolom 'kategori' tidak ditemukan di embedding_df")
print(f"Kolom yang tersedia: {list(embedding_df.columns)}")
else:
print("✅ Kolom 'kategori' tersedia")
except NameError as e:
print(f"❌ Error: {e}")
print("Pastikan semua cell sebelumnya sudah dijalankan dengan benar.")
✅ embedding_df tersedia dengan shape: (151, 57)
✅ Kolom 'kategori' tersedia
# Kode Similarity Heatmap yang Diperbaiki dengan Error Handling
def create_similarity_heatmap(embedding_df, sample_size=20):
"""
Membuat similarity heatmap dengan error handling yang baik
Parameters:
- embedding_df: DataFrame yang berisi embedding
- sample_size: Jumlah sample untuk heatmap (default: 20)
"""
try:
# Import library yang diperlukan
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np
# Cek apakah embedding_df ada
if 'embedding_df' not in locals() and 'embedding_df' not in globals():
print("❌ Error: embedding_df tidak ditemukan")
return None
# Cek apakah ada data
if len(embedding_df) == 0:
print("❌ Error: embedding_df kosong")
return None
# Tentukan kolom fitur (exclude kolom non-numerik)
feature_cols = [col for col in embedding_df.columns if col.startswith('f')]
if len(feature_cols) == 0:
print("❌ Error: Tidak ditemukan kolom fitur (f1, f2, dll)")
return None
print(f"✅ Ditemukan {len(feature_cols)} kolom fitur")
# Ambil sample dokumen
sample_size = min(sample_size, len(embedding_df))
sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
sample_embeddings = embedding_df.iloc[sample_indices][feature_cols]
print(f"✅ Sample {sample_size} dokumen untuk heatmap")
# Hitung cosine similarity
similarity_matrix = cosine_similarity(sample_embeddings)
# Visualisasi heatmap
plt.figure(figsize=(12, 10))
plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
plt.xlabel('Document Index')
plt.ylabel('Document Index')
# Tambahkan label kategori jika tersedia
if 'kategori' in embedding_df.columns:
categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
for i, cat in enumerate(categories_sample):
plt.text(i, -0.5, str(cat)[:3], rotation=45, ha='right', va='top', fontsize=8)
plt.text(0, -1.5, "Kategori:", fontsize=10, fontweight='bold')
else:
print("⚠️ Kolom 'kategori' tidak ditemukan, heatmap tanpa label kategori")
plt.tight_layout()
plt.show()
# Statistik similarity
print(f"\n📊 Statistik Similarity Matrix:")
print(f" Shape: {similarity_matrix.shape}")
print(f" Average similarity: {similarity_matrix.mean():.3f}")
print(f" Max similarity: {similarity_matrix.max():.3f}")
print(f" Min similarity: {similarity_matrix.min():.3f}")
# Hitung similarity tanpa diagonal (self-similarity)
mask = ~np.eye(similarity_matrix.shape[0], dtype=bool)
off_diagonal_similarities = similarity_matrix[mask]
print(f" Average similarity (excluding diagonal): {off_diagonal_similarities.mean():.3f}")
return similarity_matrix
except Exception as e:
print(f"❌ Error dalam membuat similarity heatmap: {str(e)}")
print("Pastikan semua library sudah terinstall dan data sudah siap")
return None
# Jalankan fungsi
print("🚀 Membuat Similarity Heatmap...")
similarity_matrix = create_similarity_heatmap(embedding_df, sample_size=20)
🚀 Membuat Similarity Heatmap...
✅ Ditemukan 56 kolom fitur
✅ Sample 20 dokumen untuk heatmap
📊 Statistik Similarity Matrix:
Shape: (20, 20)
Average similarity: 0.996
Max similarity: 1.000
Min similarity: 0.980
Average similarity (excluding diagonal): 0.996